R Advanced Course

Machine learning algorithms in R - II

Objective and contents

Objective and contents

Objective: to work with machine learning algorithms in R to classify and predict spatial variables.

  • Exercise - machine learning for land cover classification
  • Exercise - machine learning for spatial prediction of variables

Exercise - machine learning for land cover classification

Land cover classification

For this exercise we will use a Landsat 8 image from Chile and a shapefile with polygons of different classes for training purposes. Let’s load the terra, e1071, and randomForest packages.

library(terra)
library(randomForest)
library(e1071)
file_paths <- "C:/User/L4_Machine_Learning_II/LANDSAT_8_C1/"
files      <- list.files(file_paths, full.names = TRUE)
covariates <- rast(files)

Land cover classification

Let’s print and plot the covariates object

print(covariates)
## class       : SpatRaster 
## dimensions  : 893, 1032, 7  (nrow, ncol, nlyr)
## resolution  : 30, 30  (x, y)
## extent      : 274095, 305055, -4274535, -4247745  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 19N (EPSG:32619) 
## sources     : LC08_L1TP_232087_20200408_20200422_01_T1_sr_band1.tif  
##               LC08_L1TP_232087_20200408_20200422_01_T1_sr_band2.tif  
##               LC08_L1TP_232087_20200408_20200422_01_T1_sr_band3.tif  
##               ... and 4 more source(s)
## names       : band ~tance, band ~tance, band ~tance, band ~tance, band ~tance, band ~tance, ... 
## min values  :        -304,        -293,         -44,         -66,          34,         -13, ... 
## max values  :        6741,        7354,        8243,        8969,        7858,        5895, ...

Land cover classification

plot(covariates[[1]])

Loading training set

Now we can load the shapefile that will be used for training purposes.

##  class       : SpatVector 
##  geometry    : polygons 
##  dimensions  : 29, 2  (geometries, attributes)
##  extent      : 274172.4, 305050.1, -4274117, -4248031  (xmin, xmax, ymin, ymax)
##  source      : training_sites_LCC.shp
##  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
##  names       :    id Class
##  type        : <int> <int>
##  values      :     2   900
##                    3   900
##                    4   900
training <- "C:/User/Shapefiles/training_sites_LCC.shp"
training <- vect(training)
print(training)

Loading training set

Let’s visualise the attribute table of the training shapefile.

data.frame(training)
##    id Class
## 1   2   900
## 2   3   900
## 3   4   900
## 4   1   900
## 5   5   100
## 6   6   100
## 7   7   100
## 8   8   400
## 9   9   400
## 10 10   400
## 11 11   400
## 12 12   400
## 13 13  1000
## 14 14   300
## 15 15   300
## 16 16   300
## 17 17   300
## 18 18   200
## 19 19   200
## 20 20   200
## 21 21   200
## 22 22   600
## 23 23   600
## 24 24   500
## 25 25   500
## 26 26   600
## 27 27   800
## 28 28   800
## 29 29   800

Loading training set

What are the clases:

Class | Description 100 | crops 200 | forest 300 | grassland 400 | shrubland 500 | wetland 600 | water 800 | urban 900 | barren land 1000 | ice_snow

Land cover classification

plot(training, axes = TRUE)

Additional indices

Besides the different Landsat bands, we can calculate the Normalised Difference Vegetation Index (NDVI) and the Normalised Difference Water Index (NDWI) so we can use them in the classification as well.

\[ NDVI = (infrared - red) / (infrared + red) \]

\[ NDWI = (green – infrared) / (green + infrared) \]

Landsat bands

Bands Wavelength (micrometers) Resolution (m)
Band 1 - Coastal aerosol 0.43-0.45 30
Band 2 - Blue 0.45-0.51 30
Band 3 - Green 0.53-0.59 30
Band 4 - Red 0.64-0.67 30
Band 5 - Near Infrared (NIR) 0.85-0.88 30
Band 6 - SWIR 1 1.57-1.65 30
Band 7 - SWIR 2 2.11-2.29 30

Computation of indices

# NDVI 
ndvi <- (covariates[[5]] - covariates[[4]]) / (covariates[[5]] + covariates[[4]])
names(ndvi) <- "ndvi"

# NDWI

ndwi <- (covariates[[3]] - covariates[[5]]) / (covariates[[3]] + covariates[[5]])
names(ndwi) <- "ndwi"

NDVI

plot(ndvi, axes = TRUE)

NDWI

plot(ndwi, axes = TRUE)

Adding additional covariates

Once that we calculated the NDVI and NDWI layers, we can add them to the covariates using the add function.

add(covariates) <- c(ndvi, ndwi)
names(covariates)
## [1] "band 1 surface reflectance" "band 2 surface reflectance"
## [3] "band 3 surface reflectance" "band 4 surface reflectance"
## [5] "band 5 surface reflectance" "band 6 surface reflectance"
## [7] "band 7 surface reflectance" "ndvi"                      
## [9] "ndwi"

Visualisation of NDVI and training set

plot(ndvi)
plot(training, col = "cyan", add = TRUE)

Preparing training data

To train the RF and SVM models we have to create a data.frame object that contains the variable that we want to predict and their respective covariates. For this purpose, we will use the extract function from the terra package.

training_df <- terra::extract(covariates, training)
head(training_df)
##   ID band 1 surface reflectance band 2 surface reflectance
## 1  1                        415                        554
## 2  1                        571                        710
## 3  1                        392                        496
## 4  1                        253                        318
## 5  1                        384                        517
## 6  1                        386                        491
##   band 3 surface reflectance band 4 surface reflectance
## 1                        822                       1009
## 2                        974                       1168
## 3                        717                        903
## 4                        496                        646
## 5                        799                       1057
## 6                        797                        959
##   band 5 surface reflectance band 6 surface reflectance
## 1                       1131                       1293
## 2                       1286                       1397
## 3                        949                       1090
## 4                        681                        767
## 5                       1186                       1293
## 6                       1055                       1259
##   band 7 surface reflectance       ndvi       ndwi
## 1                        994 0.05700935 -0.1582181
## 2                       1096 0.04808476 -0.1380531
## 3                        879 0.02483801 -0.1392557
## 4                        629 0.02637528 -0.1571793
## 5                       1028 0.05751226 -0.1949622
## 6                        969 0.04766634 -0.1393089
summary(training_df)
##        ID        band 1 surface reflectance band 2 surface reflectance
##  Min.   : 1.00   Min.   :   2.0             Min.   :  17.0            
##  1st Qu.: 4.00   1st Qu.: 185.0             1st Qu.: 226.0            
##  Median : 9.00   Median : 261.0             Median : 312.0            
##  Mean   : 9.48   Mean   : 596.9             Mean   : 685.3            
##  3rd Qu.:14.00   3rd Qu.: 408.0             3rd Qu.: 485.0            
##  Max.   :29.00   Max.   :5784.0             Max.   :6488.0            
##  band 3 surface reflectance band 4 surface reflectance
##  Min.   :  41               Min.   :  -1.0            
##  1st Qu.: 322               1st Qu.: 327.0            
##  Median : 428               Median : 499.0            
##  Mean   : 861               Mean   : 942.1            
##  3rd Qu.: 678               3rd Qu.: 792.0            
##  Max.   :7417               Max.   :8011.0            
##  band 5 surface reflectance band 6 surface reflectance
##  Min.   :  93               Min.   :  50.0            
##  1st Qu.: 572               1st Qu.: 453.0            
##  Median :1095               Median : 659.0            
##  Mean   :1580               Mean   : 897.2            
##  3rd Qu.:2112               3rd Qu.:1199.0            
##  Max.   :7232               Max.   :4366.0            
##  band 7 surface reflectance      ndvi               ndwi        
##  Min.   :  13.0             Min.   :-0.36949   Min.   :-0.8667  
##  1st Qu.: 311.8             1st Qu.: 0.03926   1st Qu.:-0.6267  
##  Median : 609.0             Median : 0.08043   Median :-0.1940  
##  Mean   : 669.8             Mean   : 0.28911   Mean   :-0.3304  
##  3rd Qu.: 870.0             3rd Qu.: 0.60382   3rd Qu.:-0.1278  
##  Max.   :4308.0             Max.   : 1.00948   Max.   : 0.6117

Sampling reccomentations

As we have learned, data is crucial for applying machine learning algorithms. In general terms, the sample should representatively describe the complexity of the problem that you are trying to solve. In other words, you should provide enough training data to capture the relationships between dependent and independent variables. Some points to consider:

  • The training data should be representative (include intra-class variations)
  • The training and validation data must be independent
  • Each class should be equally represented when possible
  • Keep in mind that every grid-cell is a training sample

Preparing training data

Now, we have to replace the IDs of the polygons from the training shapefile with the class values of the different land cover classes.

classes <- data.frame(training)

for(i in 1:nrow(training_df)){
  
  training_df$ID[i] <- classes[training_df$ID[i],2]
  
}

Preparing training data

For simplicity, lets change the names of the predicted variable and the covariates.

names(training_df) <- c("class", paste0("band", 1:7), "ndvi", "ndwi")
str(training_df)
## 'data.frame':    8348 obs. of  10 variables:
##  $ class: num  900 900 900 900 900 900 900 900 900 900 ...
##  $ band1: num  415 571 392 253 384 386 639 417 388 244 ...
##  $ band2: num  554 710 496 318 517 491 787 471 445 317 ...
##  $ band3: num  822 974 717 496 799 ...
##  $ band4: num  1009 1168 903 646 1057 ...
##  $ band5: num  1131 1286 949 681 1186 ...
##  $ band6: num  1293 1397 1090 767 1293 ...
##  $ band7: num  994 1096 879 629 1028 ...
##  $ ndvi : num  0.057 0.0481 0.0248 0.0264 0.0575 ...
##  $ ndwi : num  -0.158 -0.138 -0.139 -0.157 -0.195 ...

Preparing training data

As observed, the class variable is numeric. In order to work with classification rather than regression we have to convert this variable to factor.

training_df$class <- as.factor(training_df$class)
str(training_df)
## 'data.frame':    8348 obs. of  10 variables:
##  $ class: Factor w/ 9 levels "100","200","300",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ band1: num  415 571 392 253 384 386 639 417 388 244 ...
##  $ band2: num  554 710 496 318 517 491 787 471 445 317 ...
##  $ band3: num  822 974 717 496 799 ...
##  $ band4: num  1009 1168 903 646 1057 ...
##  $ band5: num  1131 1286 949 681 1186 ...
##  $ band6: num  1293 1397 1090 767 1293 ...
##  $ band7: num  994 1096 879 629 1028 ...
##  $ ndvi : num  0.057 0.0481 0.0248 0.0264 0.0575 ...
##  $ ndwi : num  -0.158 -0.138 -0.139 -0.157 -0.195 ...

Training the models

Finally, we can train the models. We will evaluate RF and the four different kernels for SVM. This procedure may take a while.

rf_model       <- randomForest(class ~ ., data = training_df)
svm_model_lin  <- svm(class ~ ., data = training_df, kernel = "linear")
svm_model_poly <- svm(class ~ ., data = training_df, kernel = "polynomial")
svm_model_rad  <- svm(class ~ ., data = training_df, kernel = "radial")
svm_model_sig  <- svm(class ~ ., data = training_df, kernel = "sigmoid")

Random Forest

We can print the RF model to assess the classification:

print(rf_model)
## 
## Call:
##  randomForest(formula = class ~ ., data = training_df) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 1.27%
## Confusion matrix:
##      100  200 300  400 500 600 800  900 1000  class.error
## 100   47    0  18    0   0   0   0    0    0 0.2769230769
## 200    0 1308   0   14   0   0   0    0    0 0.0105900151
## 300    6    0 669    8   0   0   3    0    0 0.0247813411
## 400    0   19  12 1157   0   0   0    0    0 0.0260942761
## 500    0    0   0    1   6   0   0    2    0 0.3333333333
## 600    0    0   1    1   1 193   2    9    0 0.0676328502
## 800    0    0   3    0   0   3  66    0    0 0.0833333333
## 900    0    0   0    0   0   1   1 3992    0 0.0005007511
## 1000   0    0   0    0   0   0   0    1  804 0.0012422360

Random Forest

varImpPlot(rf_model)

Prediction

Now, we can predict the land cover classes over the Landsat image (this will take even more time).

# Replacing the names
names(covariates) <- c(paste0("band", 1:7), "ndvi", "ndwi")

rf_LC       <- predict(covariates, rf_model)
svm_LC_lin  <- predict(covariates, svm_model_lin)
svm_LC_poly <- predict(covariates, svm_model_poly)
svm_LC_rad  <- predict(covariates, svm_model_rad)
svm_LC_sig  <- predict(covariates, svm_model_sig)

Visualising the classification

Visualising the classification

Visualising the classification

Visualising the classification

Visualising the classification

Wait a minute… which model performed the best?

Verification

To evaluate the performance of the machine learning algorithms we will use the verification_sites shapefile.

##  class       : SpatVector 
##  geometry    : polygons 
##  dimensions  : 27, 2  (geometries, attributes)
##  extent      : 274178.3, 304992.4, -4274494, -4249372  (xmin, xmax, ymin, ymax)
##  source      : verification_sites_LCC.shp
##  coord. ref. : WGS 84 / UTM zone 19N (EPSG:32619) 
##  names       :    id Class
##  type        : <int> <int>
##  values      :     1   100
##                    2   100
##                    3   100
verification <- "C:/User/Shapefiles/verification_sites_LCC.shp"
verification <- vect(verification)
print(verification)

Loading verification set

Let’s visualise the attribute table of the verification shapefile.

data.frame(verification)
##    id Class
## 1   1   100
## 2   2   100
## 3   3   100
## 4   4   200
## 5   5   200
## 6   6   200
## 7   7   300
## 8   8   300
## 9   9   300
## 10 10   400
## 11 11   400
## 12 12   400
## 13 13   500
## 14 14   500
## 15 15   500
## 16 16   600
## 17 17   600
## 18 18   600
## 19 19   800
## 20 20   800
## 21 21   800
## 22 22   900
## 23 23   900
## 24 24   900
## 25 25  1000
## 26 26  1000
## 27 27  1000

Stacking predictions

Let’s stack the different models.

results <- rf_LC
add(results) <- c(svm_LC_lin, svm_LC_poly, svm_LC_rad, svm_LC_sig)
print(results)
## class       : SpatRaster 
## dimensions  : 893, 1032, 5  (nrow, ncol, nlyr)
## resolution  : 30, 30  (x, y)
## extent      : 274095, 305055, -4274535, -4247745  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 19N (EPSG:32619) 
## sources     : memory  
##               memory  
##               memory  
##               ... and 2 more source(s)
## names       : lyr1, lyr1, lyr1, lyr1, lyr1 
## min values  :    1,    1,    1,    1,    1 
## max values  :    9,    9,    9,    9,    9

Reclassifying predictions

Let’s reclassify the results stack with the.

rcl <- matrix(c(1:9, 100, 200, 300, 400, 500, 600, 800, 900, 1000), byrow = FALSE, ncol = 2)
print(rcl)
##       [,1] [,2]
##  [1,]    1  100
##  [2,]    2  200
##  [3,]    3  300
##  [4,]    4  400
##  [5,]    5  500
##  [6,]    6  600
##  [7,]    7  800
##  [8,]    8  900
##  [9,]    9 1000
results <- classify(results, rcl)

Reclassifying predictions

Let’s reclassify the results stack with the.

print(results)
## class       : SpatRaster 
## dimensions  : 893, 1032, 5  (nrow, ncol, nlyr)
## resolution  : 30, 30  (x, y)
## extent      : 274095, 305055, -4274535, -4247745  (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 19N (EPSG:32619) 
## source      : memory 
## names       : lyr1, lyr1, lyr1, lyr1, lyr1 
## min values  :  100,  100,  100,  100,  100 
## max values  : 1000, 1000, 1000, 1000, 1000

Reclassifying predictions

Let’s plot different models.

plot(results)

Preparing verification set

We will use the extract function to prepare the verification data.frame as we did for the training set.

verification_df <- terra::extract(results, verification)
head(verification_df)
##   ID lyr1 lyr1 lyr1 lyr1 lyr1
## 1  1  100  300  300  300  300
## 2  1  100  300  300  300  300
## 3  1  100  300  300  300  300
## 4  1  100  300  300  300  300
## 5  1  100  300  300  300  300
## 6  1  100  300  300  300  300

Preparing verification set

Now, we have to replace the IDs of the polygons from the verification shapefile with the class values of the different land cover classes.

classes <- data.frame(verification)

for(i in 1:nrow(verification_df)){
  
  verification_df$ID[i] <- classes[verification_df$ID[i],2]
  
}

names(verification_df) <- c("class", "RF", "SVM_lin", "SVM_poly",
                            "SVM_rad", "SVM_sig")

head(verification_df)
##   class  RF SVM_lin SVM_poly SVM_rad SVM_sig
## 1   100 100     300      300     300     300
## 2   100 100     300      300     300     300
## 3   100 100     300      300     300     300
## 4   100 100     300      300     300     300
## 5   100 100     300      300     300     300
## 6   100 100     300      300     300     300
tail(verification_df)
##      class   RF SVM_lin SVM_poly SVM_rad SVM_sig
## 3095  1000 1000    1000     1000    1000    1000
## 3096  1000 1000    1000     1000    1000    1000
## 3097  1000 1000    1000     1000    1000    1000
## 3098  1000 1000    1000     1000    1000    1000
## 3099  1000 1000    1000     1000    1000    1000
## 3100  1000 1000    1000     1000    1000    1000

Evaluation of performance

Finally, we can apply a confusion matrix to assess the accuracy of each method. For this purpose, we will use the confusionMatrix function from the caret package.

confusionMatrix(data = as.factor(verification_df$RF), 
                reference = as.factor(verification_df$class))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 100 200 300 400 500 600 800 900 1000
##       100   17   0  51   0   1   0   2   0    0
##       200    0 934   0   5   0   0   0   0    0
##       300   22   0 253   2   2   0   3   0    0
##       400    0  67  14 393   5   2   0   0    0
##       500    0   0   0   0   1   2   0   0    0
##       600    0   0   0   0   0  31   0   4    9
##       800    0   0   5   0   0   0  28   1    0
##       900    0   0   0   0   0   6   3 932    3
##       1000   0   0   0   0   0   0   0   0  302
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9326          
##                  95% CI : (0.9232, 0.9412)
##     No Information Rate : 0.3229          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9125          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 100 Class: 200 Class: 300 Class: 400 Class: 500
## Sensitivity            0.435897     0.9331    0.78328     0.9825  0.1111111
## Specificity            0.982359     0.9976    0.98956     0.9674  0.9993530
## Pos Pred Value         0.239437     0.9947    0.89716     0.8170  0.3333333
## Neg Pred Value         0.992737     0.9690    0.97516     0.9973  0.9974169
## Prevalence             0.012581     0.3229    0.10419     0.1290  0.0029032
## Detection Rate         0.005484     0.3013    0.08161     0.1268  0.0003226
## Detection Prevalence   0.022903     0.3029    0.09097     0.1552  0.0009677
## Balanced Accuracy      0.709128     0.9653    0.88642     0.9750  0.5552320
##                      Class: 600 Class: 800 Class: 900 Class: 1000
## Sensitivity             0.75610   0.777778     0.9947     0.96178
## Specificity             0.99575   0.998042     0.9945     1.00000
## Pos Pred Value          0.70455   0.823529     0.9873     1.00000
## Neg Pred Value          0.99673   0.997391     0.9977     0.99571
## Prevalence              0.01323   0.011613     0.3023     0.10129
## Detection Rate          0.01000   0.009032     0.3006     0.09742
## Detection Prevalence    0.01419   0.010968     0.3045     0.09742
## Balanced Accuracy       0.87592   0.887910     0.9946     0.98089

Evaluation of performance

confusionMatrix(data = as.factor(verification_df$SVM_lin), 
                reference = as.factor(verification_df$class))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 100 200 300 400 500 600 800 900 1000
##       100    1   0   0   0   2   0   0   0    0
##       200    0 958   0   0   1   0   0   0    0
##       300   30   0 315   2   1   0   3   0    0
##       400    3  42   8 398   1   2   0   0    0
##       500    0   1   0   0   3   3   0   0    0
##       600    0   0   0   0   0  12   0   0    0
##       800    0   0   0   0   0   0  28   2    0
##       900    5   0   0   0   1   5   5 935    7
##       1000   0   0   0   0   0  19   0   0  307
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9539         
##                  95% CI : (0.9459, 0.961)
##     No Information Rate : 0.3229         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9397         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: 100 Class: 200 Class: 300 Class: 400 Class: 500
## Sensitivity           0.0256410     0.9570     0.9752     0.9950  0.3333333
## Specificity           0.9993466     0.9995     0.9870     0.9793  0.9987059
## Pos Pred Value        0.3333333     0.9990     0.8974     0.8767  0.4285714
## Neg Pred Value        0.9877301     0.9799     0.9971     0.9992  0.9980601
## Prevalence            0.0125806     0.3229     0.1042     0.1290  0.0029032
## Detection Rate        0.0003226     0.3090     0.1016     0.1284  0.0009677
## Detection Prevalence  0.0009677     0.3094     0.1132     0.1465  0.0022581
## Balanced Accuracy     0.5124938     0.9783     0.9811     0.9871  0.6660196
##                      Class: 600 Class: 800 Class: 900 Class: 1000
## Sensitivity            0.292683   0.777778     0.9979     0.97771
## Specificity            1.000000   0.999347     0.9894     0.99318
## Pos Pred Value         1.000000   0.933333     0.9760     0.94172
## Neg Pred Value         0.990609   0.997394     0.9991     0.99748
## Prevalence             0.013226   0.011613     0.3023     0.10129
## Detection Rate         0.003871   0.009032     0.3016     0.09903
## Detection Prevalence   0.003871   0.009677     0.3090     0.10516
## Balanced Accuracy      0.646341   0.888563     0.9936     0.98544

Evaluation of performance

confusionMatrix(data = as.factor(verification_df$SVM_poly), 
                reference = as.factor(verification_df$class))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 100 200 300 400 500 600 800 900 1000
##       100    1   0   0   0   0   0   0   0    0
##       200    0 941   0   0   0   0   0   0    0
##       300   25   0 298   2   0   0   3   0    0
##       400    4  60  16 398   5   3   0   0    0
##       500    0   0   0   0   0   0   0   0    0
##       600    0   0   0   0   0  11   0   0    1
##       800    0   0   0   0   0   0  20   9    0
##       900    9   0   9   0   4   8  13 928   16
##       1000   0   0   0   0   0  19   0   0  297
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9335          
##                  95% CI : (0.9242, 0.9421)
##     No Information Rate : 0.3229          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.913           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 100 Class: 200 Class: 300 Class: 400 Class: 500
## Sensitivity           0.0256410     0.9401    0.92260     0.9950   0.000000
## Specificity           1.0000000     1.0000    0.98920     0.9674   1.000000
## Pos Pred Value        1.0000000     1.0000    0.90854     0.8189        NaN
## Neg Pred Value        0.9877380     0.9722    0.99098     0.9992   0.997097
## Prevalence            0.0125806     0.3229    0.10419     0.1290   0.002903
## Detection Rate        0.0003226     0.3035    0.09613     0.1284   0.000000
## Detection Prevalence  0.0003226     0.3035    0.10581     0.1568   0.000000
## Balanced Accuracy     0.5128205     0.9700    0.95590     0.9812   0.500000
##                      Class: 600 Class: 800 Class: 900 Class: 1000
## Sensitivity            0.268293   0.555556     0.9904     0.94586
## Specificity            0.999673   0.997063     0.9727     0.99318
## Pos Pred Value         0.916667   0.689655     0.9402     0.93987
## Neg Pred Value         0.990285   0.994790     0.9957     0.99389
## Prevalence             0.013226   0.011613     0.3023     0.10129
## Detection Rate         0.003548   0.006452     0.2994     0.09581
## Detection Prevalence   0.003871   0.009355     0.3184     0.10194
## Balanced Accuracy      0.633983   0.776309     0.9816     0.96952

Evaluation of performance

confusionMatrix(data = as.factor(verification_df$SVM_rad), 
                reference = as.factor(verification_df$class))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 100 200 300 400 500 600 800 900 1000
##       100    0   0   0   0   0   0   0   0    0
##       200    0 923   0   0   1   0   0   0    0
##       300   39   0 317   2   3   0   6   0    0
##       400    0  78   6 398   3   3   0   0    0
##       500    0   0   0   0   1   0   0   0    0
##       600    0   0   0   0   0  30   0   0    4
##       800    0   0   0   0   0   0  26   1    0
##       900    0   0   0   0   1   8   4 936    7
##       1000   0   0   0   0   0   0   0   0  303
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9465          
##                  95% CI : (0.9379, 0.9541)
##     No Information Rate : 0.3229          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9303          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 100 Class: 200 Class: 300 Class: 400 Class: 500
## Sensitivity             0.00000     0.9221     0.9814     0.9950  0.1111111
## Specificity             1.00000     0.9995     0.9820     0.9667  1.0000000
## Pos Pred Value              NaN     0.9989     0.8638     0.8156  1.0000000
## Neg Pred Value          0.98742     0.9642     0.9978     0.9992  0.9974185
## Prevalence              0.01258     0.3229     0.1042     0.1290  0.0029032
## Detection Rate          0.00000     0.2977     0.1023     0.1284  0.0003226
## Detection Prevalence    0.00000     0.2981     0.1184     0.1574  0.0003226
## Balanced Accuracy       0.50000     0.9608     0.9817     0.9808  0.5555556
##                      Class: 600 Class: 800 Class: 900 Class: 1000
## Sensitivity            0.731707   0.722222     0.9989     0.96497
## Specificity            0.998692   0.999674     0.9908     1.00000
## Pos Pred Value         0.882353   0.962963     0.9791     1.00000
## Neg Pred Value         0.996412   0.996746     0.9995     0.99607
## Prevalence             0.013226   0.011613     0.3023     0.10129
## Detection Rate         0.009677   0.008387     0.3019     0.09774
## Detection Prevalence   0.010968   0.008710     0.3084     0.09774
## Balanced Accuracy      0.865200   0.860948     0.9948     0.98248

Evaluation of performance

confusionMatrix(data = as.factor(verification_df$SVM_sig), 
                reference = as.factor(verification_df$class))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 100 200 300 400 500 600 800 900 1000
##       100    1   0  26   0   0   0  16 110    0
##       200    0 872   0   0   5   1   0   0    0
##       300   19   0 217   9   0   0   3  22    0
##       400    0 129  38 366   1   6   1   0    0
##       500    0   0   0   0   0   0   0   0    0
##       600    0   0   0   0   0  14   0 218   29
##       800    0   0   0   0   0   0   0   0    0
##       900   19   0  42  25   3   1  16 587    1
##       1000   0   0   0   0   0  19   0   0  284
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7552          
##                  95% CI : (0.7396, 0.7702)
##     No Information Rate : 0.3229          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6931          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 100 Class: 200 Class: 300 Class: 400 Class: 500
## Sensitivity           0.0256410     0.8711     0.6718     0.9150   0.000000
## Specificity           0.9503430     0.9971     0.9809     0.9352   1.000000
## Pos Pred Value        0.0065359     0.9932     0.8037     0.6765        NaN
## Neg Pred Value        0.9871055     0.9419     0.9625     0.9867   0.997097
## Prevalence            0.0125806     0.3229     0.1042     0.1290   0.002903
## Detection Rate        0.0003226     0.2813     0.0700     0.1181   0.000000
## Detection Prevalence  0.0493548     0.2832     0.0871     0.1745   0.000000
## Balanced Accuracy     0.4879920     0.9341     0.8264     0.9251   0.500000
##                      Class: 600 Class: 800 Class: 900 Class: 1000
## Sensitivity            0.341463    0.00000     0.6265     0.90446
## Specificity            0.919255    1.00000     0.9505     0.99318
## Pos Pred Value         0.053640        NaN     0.8458     0.93729
## Neg Pred Value         0.990490    0.98839     0.8545     0.98927
## Prevalence             0.013226    0.01161     0.3023     0.10129
## Detection Rate         0.004516    0.00000     0.1894     0.09161
## Detection Prevalence   0.084194    0.00000     0.2239     0.09774
## Balanced Accuracy      0.630359    0.50000     0.7885     0.94882

Exercise - machine learning for spatial prediction of variables

pH prediction

This exercise aims to predict pH over the same area where we applied the land cover classification. For this purpose, we will use the SoilGrids dataset. Specifically:

  • Organic carbon density (g dm\(^{-3}\))
  • Soil organic carbon stock (t ha\(^{-1}\))
  • Bulk density (cg cm\(^{-3}\))
  • Clay content (g kg\(^{-1}\))

pH prediction

  • Coarse fragments (cm\(^{3}\) dm\(^{-3}\))
  • Sand (g kg\(^{-1}\))
  • Silt (g kg\(^{-1}\))
  • Cation exchange capacity (at pH 7) (mmol(c) kg\(^{-1}\))
  • Nitrogen (cg kg\(^{-1}\))
  • Soil organic carbon (dg kg\(^{-1}\))

Importing sample sites

First, let’s import the 20,000 pH sample sites.

ph_data <-"C:/Data/L4_Machine_Learning_II/pH_data/pH_data.csv"
ph_data <- read.csv(ph_data)
head(ph_data)
##         Lon       Lat  pH
## 1 -71.86086 -38.54241 5.3
## 2 -71.95136 -37.23059 6.0
## 3 -71.69570 -38.36320 5.4
## 4 -71.32919 -37.96894 5.9
## 5 -71.43552 -38.18160 5.8
## 6 -71.04186 -37.16368 5.9
dim(ph_data)
## [1] 20000     3

Preparing training and verification sets

We will select 80% of the sample sites for training purposes and the rest for verification purposes. We will set a seed for reproducibility purposes.

set.seed(28)

pos <- sample(1:20000, round(20000 * 0.8))
training     <- ph_data[pos,]
verification <- ph_data[-pos,]
dim(training)
## [1] 16000     3
dim(verification)
## [1] 4000    3

Importing covariates

Let’s import the covariates that will be used in the training and prediction steps.

files <- "C:/Data/L4_Machine_Learning_II/SoilGrids/"
files <- list.files(files, full.names = TRUE)
covariates <- rast(files)

Convert training sites to a spatial object

We have to extract the information of the covariates at the training sites. For this, we need a vector layer of them.

training_points <- vect(training, geom=c("Lon", "Lat"), crs = crs(covariates))
print(training_points)
##  class       : SpatVector 
##  geometry    : points 
##  dimensions  : 16000, 1  (geometries, attributes)
##  extent      : -71.99887, -71.00113, -38.99881, -37.00119  (xmin, xmax, ymin, ymax)
##  coord. ref. : lon/lat WGS 84 (EPSG:4326) 
##  names       :    pH
##  type        : <num>
##  values      :   5.5
##                  5.9
##                  5.3

Extract covariates

We will use the extract function from the terra package to generate our training data frame.

training_df <- terra::extract(covariates, training_points)
head(training_df)
##   ID Bulk_density_0_5cm Cation_exchange_capacity_0_5cm Clay_content_0_5cm
## 1  1                 85                            356                282
## 2  2                 93                            318                230
## 3  3                 87                            332                291
## 4  4                 86                            335                244
## 5  5                 95                            296                173
## 6  6                 94                            340                287
##   Coarse_fragments_0_5cm Nitrogen_0_5cm Organic_carbon_density_0_5cm Sand_0_5cm
## 1                    108            626                          710        298
## 2                    205            737                          713        391
## 3                    109            565                          732        351
## 4                    133            626                          655        306
## 5                     96            625                          662        493
## 6                     87            527                          740        251
##   Silt_0_5cm Soil_organic_carbon_0_5cm Soil_organic_carbon_stock_0_30cm
## 1        420                      1517                              124
## 2        379                      1713                              115
## 3        358                      1528                              123
## 4        451                      1494                              100
## 5        333                      1369                               84
## 6        462                      1390                              120

Extract covariates

Let’s replace the column of IDs of the data frame (that we don’t need) with the pH values.

training_df$ID <- training$pH
names(training_df)[1] <- "pH"
head(training_df)
##    pH Bulk_density_0_5cm Cation_exchange_capacity_0_5cm Clay_content_0_5cm
## 1 5.5                 85                            356                282
## 2 5.9                 93                            318                230
## 3 5.3                 87                            332                291
## 4 5.5                 86                            335                244
## 5 6.1                 95                            296                173
## 6 5.5                 94                            340                287
##   Coarse_fragments_0_5cm Nitrogen_0_5cm Organic_carbon_density_0_5cm Sand_0_5cm
## 1                    108            626                          710        298
## 2                    205            737                          713        391
## 3                    109            565                          732        351
## 4                    133            626                          655        306
## 5                     96            625                          662        493
## 6                     87            527                          740        251
##   Silt_0_5cm Soil_organic_carbon_0_5cm Soil_organic_carbon_stock_0_30cm
## 1        420                      1517                              124
## 2        379                      1713                              115
## 3        358                      1528                              123
## 4        451                      1494                              100
## 5        333                      1369                               84
## 6        462                      1390                              120

Train the models

Now we have everything to train the models.

# Training (this will take time)
rf_model       <- randomForest(pH ~ ., data = training_df)
svm_model_lin  <- svm(pH ~ ., data = training_df, kernel = "linear")
svm_model_poly <- svm(pH ~ ., data = training_df, kernel = "polynomial")
svm_model_rad  <- svm(pH ~ ., data = training_df, kernel = "radial")

Importance of variables for RF

We can use the varImpPlot function to assess the importance of the covariates.

varImpPlot(rf_model)

Prediction

Now we can predict spacially the pH over the study area.

# Prediction
rf_LC       <- predict(covariates, rf_model)
svm_LC_lin  <- predict(covariates, svm_model_lin)
svm_LC_poly <- predict(covariates, svm_model_poly)
svm_LC_rad  <- predict(covariates, svm_model_rad)

Visualising the prediction

Visualising the prediction

Visualising the prediction

Visualising the prediction

Stacking results

Now, we can create a stack of the predicted models to later use them for verification purposes.

results <- rf_LC
add(results) <- c(svm_LC_lin, svm_LC_poly, svm_LC_rad)
names(results) <- c("RF", "SVM_lin", "SVM_poly", "SVM_rad")
print(results)
## class       : SpatRaster 
## dimensions  : 837, 442, 4  (nrow, ncol, nlyr)
## resolution  : 0.002262443, 0.002389486  (x, y)
## extent      : -72, -71, -39, -37  (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 (EPSG:4326) 
## sources     : memory  
##               memory  
##               memory  
##               ... and 1 more source(s)
## names       :            RF,       SVM_lin,      SVM_poly,       SVM_rad 
## min values  :  2.564704e-14,  9.081915e-02, -9.031603e-02,  9.088693e-02 
## max values  :      6.437633,      6.373022,      7.265613,      6.438758

Verification

We have to generate a spatial object with the verification sites as well.

verification_points <- vect(verification, geom=c("Lon", "Lat"), crs = crs(covariates))

Now, we can extract the pH prediction over the verification sites.

verification_df <- terra::extract(results, verification_points)
head(verification_df)
##   ID       RF  SVM_lin SVM_poly  SVM_rad
## 1  1 5.964537 6.158771 5.950130 5.918034
## 2  2 6.101903 6.165129 6.089038 6.046751
## 3  3 5.476773 5.474062 5.329843 5.415686
## 4  4 5.978840 5.965380 5.855315 5.953640
## 5  5 5.911273 5.771236 5.868699 5.866439
## 6  6 5.873360 5.862882 5.888656 5.881538

Preparing verification dataset

Similarly to what we did for the training data, we have to replace the ID values for the pH ones.

verification_df$ID <- verification$pH
names(verification_df)[1] <- "pH"
head(verification_df)
##    pH       RF  SVM_lin SVM_poly  SVM_rad
## 1 6.0 5.964537 6.158771 5.950130 5.918034
## 2 6.0 6.101903 6.165129 6.089038 6.046751
## 3 5.3 5.476773 5.474062 5.329843 5.415686
## 4 6.0 5.978840 5.965380 5.855315 5.953640
## 5 5.9 5.911273 5.771236 5.868699 5.866439
## 6 5.8 5.873360 5.862882 5.888656 5.881538

Evaluation

Finally, we can assess the performance of the models. For this purpose, we will use the rmse function from the hydroGOF package.

## Evaluation RMSE
rmse(obs = verification_df$pH, sim = verification_df$RF)
## [1] 0.09447623
rmse(obs = verification_df$pH, sim = verification_df$SVM_lin)
## [1] 0.1340951
rmse(obs = verification_df$pH, sim = verification_df$SVM_poly)
## [1] 0.1413756
rmse(obs = verification_df$pH, sim = verification_df$SVM_rad)
## [1] 0.1037692